We're going to look at shootings and homicides from the Tribune's internal database and group them by community area. This expands on the example in the "Shootings and homicides within the Austin community area" notebook because it gets data for all community areas and uses a spatial index so we don't have to loop through all the community areas for each incident.

First, we need to get the community area boundaries.


In [1]:
import requests

def get_chicago_community_areas():
    url = 'https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON'
    resp = requests.get(url, verify=False)
    return resp.json()

community_areas = get_chicago_community_areas()


/Users/ghing/venvs/public-notebooks/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py:791: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
  InsecureRequestWarning)

Now, let's convert the GeoJSON dicts to shapes that we can use to look up which community area a shooting or homicide is in


In [2]:
from shapely.geometry import shape
# Get the shapes as a map between community area number and shape as we'll need the IDs anyway to build our index later
community_area_shapes = {int(f['properties']['area_num_1']): shape(f['geometry']) for f in community_areas['features']}
community_area_properties = {int(f['properties']['area_num_1']): f['properties'] for f in community_areas['features']}

Build a spatial index of community areas


In [3]:
from rtree import index

communty_area_index = index.Index()
for ca_number, ca_shape in community_area_shapes.items():
    communty_area_index.add(ca_number, ca_shape.bounds, obj=community_area_properties[ca_number])

Let's spot check our index, just because the coordinate format, (left, bottom, right, top) is a little confusing to me.


In [4]:
from shapely.geometry import Point

def point_to_bounds(point):
    """
    Convert a point to a bounding box
    
    It makes sense to represent points as an x,y pair, but RTree only operates
    on bounding boxes. Convert the point to a bounding box where left == right
    and top == bottom.

    """
    return (point[0], point[1], point[0], point[1])

def get_community_area(point, ca_idx, ca_shapes):
    areas = []
    for n in ca_idx.intersection(point_to_bounds(point), objects=True):
        ca_number = int(n.object['area_num_1'])
        ca_shape = ca_shapes[ca_number]
        if ca_shape.contains(Point(*point)):
            areas.append(n.object)
    return areas
        
# Turkey Chop is a restaurant that is most definitely in Humboldt Park
# Let's use it to spot-check our index
turkey_chop_coords = [-87.7141142377237, 41.8955710581678]

turkey_chop_ca = get_community_area(turkey_chop_coords, communty_area_index, community_area_shapes)
assert turkey_chop_ca[0]['community'] == "HUMBOLDT PARK"

Now, let's get some data from NewsroomDB, the Tribune's internal database of homicides and shootings


In [5]:
import os
import requests

# Some constants
NEWSROOMDB_URL = os.environ['NEWSROOMDB_URL']

# A big object to hold all our data between steps
data = {}

def get_table_url(table_name, base_url=NEWSROOMDB_URL):
    return '{}table/json/{}'.format(base_url, table_name)

def get_table_data(table_name):
    url = get_table_url(table_name)
    
    try:
        r = requests.get(url)
        return r.json()
    except:
        print("Request failed. Probably because the response is huge.  We should fix this.")
        return get_table_data(table_name)

data['shooting_victims'] = get_table_data('shootings')
print("Loaded {} shooting victims".format(len(data['shooting_victims'])))

data['homicides'] = get_table_data('homicides')
print("Loaded {} homicides".format(len(data['homicides'])))


Request failed. Probably because the response is huge.  We should fix this.
Request failed. Probably because the response is huge.  We should fix this.
Loaded 11586 shooting victims
Loaded 1542 homicides

Let's create PANDAS dataframes out of the loaded data


In [6]:
import pandas as pd
import numpy as np

data['shooting_victims_df'] = pd.DataFrame(data['shooting_victims'])
data['homicides_df'] = pd.DataFrame(data['homicides'])

Parse the date fields into Python date objects for easier analysis and make separate month and year columns to make grouping easier.


In [7]:
from datetime import datetime

def parse_date(s):
    try:
        return datetime.strptime(s, '%Y-%m-%d').date()
    except ValueError:
        return None
    
data['shooting_victims_df']['Date'] = data['shooting_victims_df']['Date'].apply(parse_date)
data['shooting_victims_df']['month'] = data['shooting_victims_df']['Date'].apply(lambda x: x.month if x else None)
data['shooting_victims_df']['year'] = data['shooting_victims_df']['Date'].apply(lambda x: x.year if x else None)

We'll start with shootings. Assign each shooting to a community area using the index we built earlier.


In [8]:
import pprint
import re

def parse_coordinates(coordinate_str):
    """Convert a lat, lng string to a pair of lng, lat floats"""
    lat, lng = [float(c) for c in re.sub(r'[\(\) ]', '', coordinate_str).split(',')]
    return lng, lat

shooting_victim_community_areas = {}

for victim in data['shooting_victims']:
    try:
        coords = parse_coordinates(victim['Geocode Override'])
    except ValueError:
        shooting_victim_community_areas[victim['_id']] = '__invalid__'
        continue
        
    ca = get_community_area(coords, communty_area_index, community_area_shapes)
    
    if len(ca) == 0:
        shooting_victim_community_areas[victim['_id']] = '__invalid__'
        print("No community area found for record with coordinates {}".format(coords))
    elif len(ca) > 1:
        raise ValueError("Multiple community areas found for record with coordinates {}".format(coords))
    else:
        shooting_victim_community_areas[victim['_id']] = ca[0]['community']
        
data['shooting_victim_community_areas'] = pd.DataFrame([{'_id': k, 'community': v} for k, v in shooting_victim_community_areas.items()])


No community area found for record with coordinates (-87.742872, 41.762969)
No community area found for record with coordinates (-87.742872, 41.762969)
No community area found for record with coordinates (-87.690307, 41.730243)
No community area found for record with coordinates (-87.651214, 41.511413)
No community area found for record with coordinates (-87.812812, 41.911125)
No community area found for record with coordinates (-87.930351, 41.958801)
No community area found for record with coordinates (-87.682444, 41.730165)
No community area found for record with coordinates (-87.652854, 41.508438)
No community area found for record with coordinates (-87.627502, 41.504604)
No community area found for record with coordinates (-87.700546, 42.019557)
No community area found for record with coordinates (-84.5535506308079, 41.6678441315889)
No community area found for record with coordinates (-87.700546, 42.019557)
No community area found for record with coordinates (-95.9222953766584, 35.9909527748823)
No community area found for record with coordinates (-87.762807789495, 41.8110412403273)
No community area found for record with coordinates (-87.81238630414009, 41.95269003510475)
No community area found for record with coordinates (-97.94587723910809, 35.53741604089737)
No community area found for record with coordinates (-87.85710543394089, 42.867526486516)
No community area found for record with coordinates (-87.85710543394089, 42.867526486516)
No community area found for record with coordinates (-117.07955932617188, 32.69026184082031)
No community area found for record with coordinates (-88.18666309118271, 41.70995280146599)
No community area found for record with coordinates (-118.30857849121094, 33.802249908447266)
No community area found for record with coordinates (-89.16730619966984, 45.15423908829689)
No community area found for record with coordinates (-87.87754821777344, 42.09328079223633)
No community area found for record with coordinates (-119.80324544012547, 39.53507088124752)

Join the community area to the shooting victims data


In [9]:
data['shooting_victims_df__with_ca'] = data['shooting_victims_df'].merge(
    data['shooting_victim_community_areas'],
    how='left',
    on='_id')

And count the victims by community area, year and month


In [32]:
data['shooting_victims_by_ca'] = pd.DataFrame(data['shooting_victims_df__with_ca'].groupby(['community', 'year', 'month']).size())

Let's just look at March 2016 shooting victims


In [33]:
df = data['shooting_victims_by_ca']
df[(df.index.get_level_values('year') == 2016) & (df.index.get_level_values('month') == 3)].sort_values(by=0, ascending=False)


Out[33]:
0
community year month
AUSTIN 2016.0 3.0 36
HUMBOLDT PARK 2016.0 3.0 27
WEST ENGLEWOOD 2016.0 3.0 23
NORTH LAWNDALE 2016.0 3.0 18
WEST GARFIELD PARK 2016.0 3.0 15
EAST GARFIELD PARK 2016.0 3.0 13
NEW CITY 2016.0 3.0 12
AUBURN GRESHAM 2016.0 3.0 11
ENGLEWOOD 2016.0 3.0 11
SOUTH LAWNDALE 2016.0 3.0 9
ROSELAND 2016.0 3.0 9
GREATER GRAND CROSSING 2016.0 3.0 8
CHICAGO LAWN 2016.0 3.0 8
WASHINGTON HEIGHTS 2016.0 3.0 6
WEST PULLMAN 2016.0 3.0 6
WOODLAWN 2016.0 3.0 6
CHATHAM 2016.0 3.0 6
SOUTH DEERING 2016.0 3.0 5
NEAR WEST SIDE 2016.0 3.0 5
SOUTH SHORE 2016.0 3.0 4
GRAND BOULEVARD 2016.0 3.0 4
WEST TOWN 2016.0 3.0 4
MORGAN PARK 2016.0 3.0 4
BELMONT CRAGIN 2016.0 3.0 4
__invalid__ 2016.0 3.0 4
UPTOWN 2016.0 3.0 3
ROGERS PARK 2016.0 3.0 3
ALBANY PARK 2016.0 3.0 3
NEAR NORTH SIDE 2016.0 3.0 3
CALUMET HEIGHTS 2016.0 3.0 3
IRVING PARK 2016.0 3.0 2
AVONDALE 2016.0 3.0 2
BRIGHTON PARK 2016.0 3.0 2
WASHINGTON PARK 2016.0 3.0 2
EAST SIDE 2016.0 3.0 2
HERMOSA 2016.0 3.0 2
GAGE PARK 2016.0 3.0 2
LOWER WEST SIDE 2016.0 3.0 2
LOOP 2016.0 3.0 1
SOUTH CHICAGO 2016.0 3.0 1
NORWOOD PARK 2016.0 3.0 1
MCKINLEY PARK 2016.0 3.0 1
DOUGLAS 2016.0 3.0 1
RIVERDALE 2016.0 3.0 1
WEST LAWN 2016.0 3.0 1
PULLMAN 2016.0 3.0 1
WEST RIDGE 2016.0 3.0 1
BRIDGEPORT 2016.0 3.0 1
OAKLAND 2016.0 3.0 1
ASHBURN 2016.0 3.0 1

How did March Humboldt Park shootings look over time?


In [34]:
df = data['shooting_victims_by_ca']
df[(df.index.get_level_values('community') == "HUMBOLDT PARK") & (df.index.get_level_values('month') == 3)]


Out[34]:
0
community year month
HUMBOLDT PARK 2012.0 3.0 8
2013.0 3.0 5
2014.0 3.0 6
2015.0 3.0 6
2016.0 3.0 27

In [ ]: